Method and apparatus for a character-based comparison of documents

ABSTRACT

A method and system for a character-based document comparison are described. In one embodiment, the method includes dividing a first document into tokens. Each token includes a predefined number of sequential characters from the first document. The method further includes calculating hash values for the tokens and creating, for the first document, a signature including a subset of hash values from the calculated hash values and additional information pertaining to the tokens of the first document. The signature of the first document is subsequently compared with a signature of a second document to determine resemblance between the first document and the second document.

RELATED APPLICATIONS

The present application claims priority to U.S. Provisional ApplicationSer. No. 60/471,242, filed May 15, 2003, which is incorporated herein inits entirety.

FIELD OF THE INVENTION

The present invention relates to data processing; more particularly, thepresent invention relates to a character-based comparison of documents.

BACKGROUND OF THE INVENTION

The Internet is growing in popularity, and more and more people areconducting business over the Internet, advertising their products andservices by generating and sending electronic mass mailings. Theseelectronic messages (emails) are usually unsolicited and regarded asnuisances by the recipients because they occupy much of the storagespace needed for the necessary and important data processing. Forexample, a mail server may have to reject accepting an important and/ordesired email when its storage capacity is filled to the maximum withthe unwanted emails containing advertisements. Moreover, thin clientsystems such as set top boxes, PDA's, network computers, and pagers allhave limited storage capacity. Unwanted emails in any one of suchsystems can tie up a finite resource for the user. In addition, atypical user wastes time by downloading voluminous but uselessadvertisement information. These unwanted emails are commonly referredto as spam.

Presently, there are products that are capable of filtering out unwantedmessages. For example, a spam block method exists which keeps an indexlist of all spam agents (i.e., companies that generate mass unsolicitede-mails), and provides means to block any e-mail sent from a company onthe list.

Another “junk mail” filter currently available employs filters which arebased on predefined words and patterns as mentioned above. An incomingmail is designated as an unwanted mail, if the subject contains a knownspam pattern.

However, as spam filtering grows in sophistication, so do the techniquesof spammers in avoiding the filters. Examples of tactics incorporated byrecent generation of spammers include randomization, origin concealment,and filter evasion using HTML.

SUMMARY OF THE INVENTION

A method and system for a character-based comparison of documents aredescribed. According to one aspect, the method includes dividing a firstdocument into tokens. Each token includes a predefined number ofsequential characters from the first document. The method furtherincludes calculating hash values for the tokens and creating, for thefirst document, a signature including a subset of hash values from thecalculated hash values and additional information pertaining to thetokens of the first document. The signature of the first document issubsequently compared with a signature of a second document to determineresemblance between the first document and the second document.

Other features of the present invention will be apparent from theaccompanying drawings and from the detailed description that follows.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousembodiments of the invention, which, however, should not be taken tolimit the invention to the specific embodiments, but are for explanationand understanding only.

FIG. 1 is a block diagram of one embodiment of a system for controllingdelivery of spam electronic mail.

FIG. 2 is a block diagram of one embodiment of a spam contentpreparation module.

FIG. 3 is a block diagram of one embodiment of a similaritydetermination module.

FIG. 4 is a flow diagram of one embodiment of a process for handling aspam message.

FIG. 5 is a flow diagram of one embodiment of a process for filteringemail spam based on similarities measures.

FIG. 6A is a flow diagram of one embodiment of a process for creating asignature of an email message.

FIG. 6B is a flow diagram of one embodiment of a process for detectingspam using a signature of an email message.

FIG. 7 is a flow diagram of one embodiment of a process for acharacter-based comparison of documents.

FIG. 8 is a flow diagram of one embodiment of a process for determiningwhether two documents are similar.

FIG. 9 is a flow diagram of one embodiment of a process for reducingnoise in an email message.

FIG. 10 is a flow diagram of one embodiment of a process for modifyingan email message to reduce noise.

FIG. 11 is a block diagram of an exemplary computer system.

DETAILED DESCRIPTION OF THE PRESENT INVENTION

A method and apparatus for a character-based comparison of documents aredescribed. In the following description, numerous details are set forth.It will be apparent, however, to one skilled in the art, that thepresent invention may be practiced without these specific details. Inother instances, well-known structures and devices are shown in blockdiagram form, rather than in detail, in order to avoid obscuring thepresent invention.

Some portions of the detailed descriptions which follow are presented interms of algorithms and symbolic representations of operations on databits within a computer memory. These algorithmic descriptions andrepresentations are the means used by those skilled in the dataprocessing arts to most effectively convey the substance of their workto others skilled in the art. An algorithm is here, and generally,conceived to be a self-consistent sequence of steps leading to a desiredresult. The steps are those requiring physical manipulations of physicalquantities. Usually, though not necessarily, these quantities take theform of electrical or magnetic signals capable of being stored,transferred, combined, compared, and otherwise manipulated. It hasproven convenient at times, principally for reasons of common usage, torefer to these signals as bits, values, elements, symbols, characters,terms, numbers, or the like.

It should be borne in mind, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the following discussion,it is appreciated that throughout the description, discussions utilizingterms such as “processing” or “computing” or “calculating” or“determining” or “displaying” or the like, refer to the action andprocesses of a computer system, or similar electronic computing device,that manipulates and transforms data represented as physical(electronic) quantities within the computer system's registers andmemories into other data similarly represented as physical quantitieswithin the computer system memories or registers or other suchinformation storage, transmission or display devices.

The present invention also relates to apparatus for performing theoperations herein. This apparatus may be specially constructed for therequired purposes, or it may comprise a general purpose computerselectively activated or reconfigured by a computer program stored inthe computer. Such a computer program may be stored in a computerreadable storage medium, such as, but is not limited to, any type ofdisk including floppy disks, optical disks, CD-ROMs, andmagnetic-optical disks, read-only memories (ROMs), random accessmemories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any typeof media suitable for storing electronic instructions, and each coupledto a computer system bus.

The algorithms and displays presented herein are not inherently relatedto any particular computer or other apparatus. Various general purposesystems may be used with programs in accordance with the teachingsherein, or it may prove convenient to construct more specializedapparatus to perform the required method steps. The required structurefor a variety of these systems will appear from the description below.In addition, the present invention is not described with reference toany particular programming language. It will be appreciated that avariety of programming languages may be used to implement the teachingsof the invention as described herein.

A machine-readable medium includes any mechanism for storing ortransmitting information in a form readable by a machine (e.g., acomputer). For example, a machine-readable medium includes read onlymemory (“ROM”); random access memory (“RAM”); magnetic disk storagemedia; optical storage media; flash memory devices; electrical, optical,acoustical or other form of propagated signals (e.g., carrier waves,infrared signals, digital signals, etc.); etc.

Filtering Email Spam Based on Similarity Measures

FIG. 1 is a block diagram of one embodiment of a system for controllingdelivery of spam electronic mail (email). The system includes a controlcenter 102 coupled to a communications network 100 such as a publicnetwork (e.g., the Internet, a wireless network, etc.) or a privatenetwork (e.g., LAN, Intranet, etc.). The control center 102 communicateswith multiple network servers 104 via the network 100. Each server 104communicates with user terminals 106 using a private or public network.

The control center 102 is an anti-spam facility that is responsible foranalyzing messages identified as spam, developing filtering rules fordetecting spam, and distributing the filtering rules to the servers 104.A message may be identified as spam because it was sent by a known spamsource (as determined, for example, using a “spam probe”, i.e., an emailaddress specifically selected to make its way into as many spammermailing lists as possible).

A server 104 may be a mail server that receives and stores messagesaddressed to users of corresponding user terminals sent. Alternatively,a server 104 may be a different server coupled to the mail server 104.Servers 104 are responsible for filtering incoming messages based on thefiltering rules received from the control center 102.

In one embodiment, the control center 102 includes a spam contentpreparation module 108 that is responsible for generating datacharacterizing the content associated with a spam attack and sendingthis data to the servers 104. Each server 104 includes a similaritydetermination module 110 that is responsible for storing spam datareceived from the control center 102 and identifying incoming emailmessages resembling the spam content using the stored data.

In an alternative embodiment, each server 104 hosts both the spamcontent preparation module 108 that generates data characterizing thecontent associated with a spam attack and the similarity determinationmodule 110 that uses the generated data to identify email messagesresembling the spam content.

FIG. 2 is a block diagram of one embodiment of a spam contentpreparation module 200. The spam content preparation module 200 includesa spam content parser 202, a spam data generator 206, and a spam datatransmitter 208.

The spam content parser 202 is responsible for parsing the body of emailmessages resulting from spam attacks (referred to as spam messages).

The spam data generator 206 is responsible for generating datacharacterizing a spam message. In one embodiment, data characterizing aspam message includes a list of hash values calculated for sets oftokens (e.g., characters, words, lines, etc.) composing the spammessage. Data characterizing a spam message or any other email messageis referred to herein as a message signature. Signatures of spammessages or any other email messages may contain various dataidentifying the message content and may be created using variousalgorithms that enable the use of similarity measures in comparingsignatures of different email messages.

In one embodiment, the spam content preparation module 200 also includesa noise reduction algorithm 204 that is responsible for detecting dataindicative of noise and removing the noise from spam messages prior togenerating signatures of spam messages. Noise represents data invisibleto a recipient that was added to a spam message to hide its spam nature.

In one embodiment, the spam content preparation module 200 also includesa message grouping algorithm (not shown) that is responsible forgrouping messages originated from a single spam attack. Grouping may beperformed based on specified characteristics of spam messages (e.g.,included URLs, message parts, etc.). If grouping is used, the spam datagenerator 206 may generate a signature for a group of spam messagesrather than for each individual message.

The spam data transmitter 208 is responsible for distributing signaturesof spam messages to participating servers such as servers 104 of FIG. 1.In one embodiment, each server 104 periodically (e.g., each 5 minutes)initiates a connection (e.g., a secure HTTPS connection) with the callcenter 102. Using this pull-based connection, signatures are transmittedfrom the call center 102 to the relevant server 106.

FIG. 3 is a block diagram of one embodiment of a similaritydetermination module 300. The similarity determination module 300includes an incoming message parser 302, a spam data receiver 306, amessage data generator 310, a resemblance identifier 312, and a spamdatabase 304.

The incoming message parser 302 is responsible for parsing the body ofincoming email messages.

The spam data receiver 306 is responsible for receiving signatures ofspam messages and storing them in the spam database 304.

The message data generator 310 is responsible for generating signaturesof incoming email messages. In some embodiments, a signature of anincoming email message includes a list of hash values calculated forsets of tokens (e.g., characters, words, lines, etc.) composing theincoming email message. In other embodiments, a signature of an incomingemail message includes various other data characterizing the content ofthe email message (e.g., a subset of token sets composing the incomingemail message). As discussed above, signatures of email messages may becreated using various algorithms that allow for use of similaritymeasures in comparing signatures of different email messages.

In one embodiment, the similarity determination module 300 also includesan incoming message cleaning algorithm 308 that is responsible fordetecting data indicative of noise and removing the noise from theincoming email messages prior to generating their signatures, as will bediscussed in more detail below.

The resemblance identifier 312 is responsible for comparing thesignature of each incoming email message with the signatures of spammessages stored in the spam database 304 and determining, based on thiscomparison, whether an incoming email message is similar to any spammessage.

In one embodiment, the spam database 304 stores signatures generated forspam messages before they undergo the noise reduction process (i.e.,noisy spam messages) and signatures generated for these spam messagesafter they undergo the noise reduction process (i.e., spam message withreduced noise). In this embodiment, the message data generator 310 firstgenerates a signature of an incoming email message prior to noisereduction, and the resemblance identifier 312 compares this signaturewith the signatures of noisy spam messages. If this comparison indicatesthat the incoming email message is similar to one of these spammessages, then the resemblance identifier 312 marks this incoming emailmessage as spam. Alternatively, the resemblance identifier 312 invokesthe incoming message cleaning algorithm 308 to remove noise from theincoming email message. Then, the message data generator 310 generates asignature for the modified incoming message, which is then compared bythe resemblance identifier 312 with the signatures of spam messages withreduced noise.

FIG. 4 is a flow diagram of one embodiment of a process 400 for handlinga spam message. The process may be performed by processing logic thatmay comprise hardware (e.g., dedicated logic, programmable logic,microcode, etc.), software (such as run on a general purpose computersystem or a dedicated machine), or a combination of both. In oneembodiment, processing logic resides at a control center 102 of FIG. 1.

Referring to FIG. 4, process 400 begins with processing logic receivinga spam message (processing block 402).

At processing block 404, processing logic modifies the spam message toreduce noise. One embodiment of a noise reduction algorithm will bediscussed in more detail below in conjunction with FIGS. 9 and 10.

At processing block 406, processing logic generates a signature of thespam message. In one embodiment, a signature of the spam messageincludes a list of hash values calculated for sets of tokens (e.g.,characters, words, lines, etc.) composing the incoming email message, aswill be discussed in more detail below in conjunction with FIG. 6A. Inother embodiments, a signature of an incoming email message includesvarious other data characterizing the content of the email message.

At processing block 408, processing logic transfers the signature of thespam message to a server (e.g., a server 104 of FIG. 1), which uses thesignature of the spam message to find incoming email messages resemblingthe spam message (block 410).

FIG. 5 is a flow diagram of one embodiment of a process 500 forfiltering email spam based on similarities measures. The process may beperformed by processing logic that may comprise hardware (e.g.,dedicated logic, programmable logic, microcode, etc.), software (such asrun on a general purpose computer system or a dedicated machine), or acombination of both. In one embodiment, processing logic resides at aserver 104 of FIG. 1.

Referring to FIG. 5, process 500 begins with processing logic receivingan incoming email message (processing block 502).

At processing block 504, processing logic modifies the incoming messageto reduce noise. One embodiment of a noise reduction algorithm will bediscussed in more detail below in conjunction with FIGS. 9 and 10.

At processing block 506, processing logic generates a signature of theincoming message based on the content of the incoming message. In oneembodiment, a signature of an incoming email message includes a list ofhash values calculated for sets of tokens (e.g., characters, words,lines, etc.) composing the incoming email message, as will be discussedin more detail below in conjunction with FIG. 6A. In other embodiments,a signature of an incoming email message includes various other datacharacterizing the content of the email message.

At processing block 508, processing compares the signature of theincoming messages with signatures of spam messages.

At processing block 510, processing logic determines that theresemblance between the signature of the incoming message and asignature of some spam message exceeds a threshold similarity measure.One embodiment of a process for determining the resemblance between twomessages is discussed in more detail below in conjunction with FIG. 6B.

At processing block 512, processing logic marks the incoming emailmessage as spam.

FIG. 6A is a flow diagram of one embodiment of a process 600 forcreating a signature of an email message. The process may be performedby processing logic that may comprise hardware (e.g., dedicated logic,programmable logic, microcode, etc.), software (such as run on a generalpurpose computer system or a dedicated machine), or a combination ofboth. In one embodiment, processing logic resides at a server 104 ofFIG. 1.

Referring to FIG. 6A, process 600 begins with processing logic dividingan email message into sets of tokens (processing block 602). Each set oftokens may include a predefined number of sequential units from theemail message. The predefined number may be equal to, or greaterthan, 1. A unit may represent a character, a word or a line in the emailmessage. In one embodiment, each set of tokens is combined with thenumber of occurrences of this set of tokens in the email message.

At processing block 604, processing logic calculates hash values for thesets of tokens. In one embodiment, a hash value is calculated byapplying a hash function to each combination of a set of tokens and acorresponding token occurrence number.

At processing block 606, processing logic creates a signature for theemail message using the calculated hash values. In one embodiment, thesignature is created by selecting a subset of calculated hash values andadding a parameter characterizing the email message to the selectedsubset of calculated hash values. The parameter may specify, forexample, the size of the email message, the number of calculated hashvalues, the keyword associated with the email message, the name of anattachment file, etc.

In one embodiment, a signature for an email message is created using acharacter-based document comparison mechanism that will be discussed inmore detail below in conjunction with FIGS. 7 and 8.

FIG. 6B is a flow diagram of one embodiment of a process 650 fordetecting spam using a signature of an email message. The process may beperformed by processing logic that may comprise hardware (e.g.,dedicated logic, programmable logic, microcode, etc.), software (such asrun on a general purpose computer system or a dedicated machine), or acombination of both. In one embodiment, processing logic resides at aserver 104 of FIG. 1.

Referring to FIG. 6B, process 650 compares data in a signature of anincoming email message with data in a signature of each spam message.The signature data includes a parameter characterizing the content of anemail message and a subset of hash values generated for the tokenscontained in the email message. The parameter may specify, for example,the size of the email message, the number of tokens in the emailmessage, the keyword associated with the email message, the name of anattachment file, etc.

Processing logic begins with comparing a parameter in a signature of theincoming email message with a corresponding parameter in a signature ofeach spam message (processing block 652).

A decision box 654, processing logic determines whether any spam messagesignatures contain a parameter similar to the parameter of the incomingmessage signature. The similarity may be determined, for example, basedon the allowed difference between the two parameters or the allowedratio of the two parameters.

If none of the spam message signatures contain a parameter similar tothe parameter of the incoming message signature, processing logicdecides that the incoming email message is legitimate (i.e., it is notspam) (processing block 662).

Alternatively, if one or more spam message signatures have a similarparameter, processing logic determines whether the signature of he firstspam message has hash values similar to the hash values in the signatureof the incoming email (decision box 656). Based on the similaritythreshold, the hash values may be considered similar if, for example, acertain number of them matches or the ratio of matched and unmatchedhash values exceeds a specified threshold.

If the first spam message signature has hash values similar to the hashvalues of the incoming email signature, processing logic decides thatthe incoming email message is spam (processing block 670). Otherwise,processing logic further determines if there are more spam messagesignatures with the similar parameter (decision box 658). If so,processing logic determines whether the next spam message signature hashash values similar to the hash values of the incoming email signature(decision box 656). If so, processing logic decides that the incomingemail message is spam (processing block 670). If not, processing logicreturns to processing block 658.

If processing logic determines that no other spam message signatureshave the similar parameter, then it decides that the incoming mailmessage is not spam (processing block 662).

Character-Based Document Comparison Mechanism

FIG. 7 is a flow diagram of one embodiment of a process 700 for acharacter-based comparison of documents. The process may be performed byprocessing logic that may comprise hardware (e.g., dedicated logic,programmable logic, microcode, etc.), software (such as run on a generalpurpose computer system or a dedicated machine), or a combination ofboth.

Referring to FIG. 7, process 700 begins with processing logicpre-processing a document (processing block 702). In one embodiment, thedocument is pre-processed by changing each upper case alphabeticcharacter within the document to a lower case alphabetic character. Forexample, the message “I am Sam, Sam I am.” may be pre-processed into anexpression “i.am.sam.sam.i.am”.

At processing block 704, processing logic divides the document intotokens, with each token including a predefined number of sequentialcharacters from the document. In one embodiment, each token is combinedwith its occurrence number. This combination is referred to as a labeledshingle. For example, if the predefined number of sequential charactersin the token is equal to 3, the expression specified above includes thefollowing set of labeled shingles:

-   -   i.a1    -   .am1    -   am.1    -   m.s1    -   .sa1    -   sam1    -   sm.2    -   m.s1    -   .sm2    -   sam2    -   am.3    -   m.i1    -   .i.1    -   i.a2    -   .am4

In one embodiment, the shingles are represented as a histogram.

At processing block 706, processing logic calculates hash values for thetokens. In one embodiment, the hash values are calculated for thelabeled shingles. For example, if a hashing function H(x) is applied toeach labeled shingle illustrated above, the following results areproduced:

-   -   H(i.a1)->458348732    -   H(.am1)->200404023    -   H(am.1)->692939349    -   H(m.s1)->220443033    -   H(.sa1)->554034022    -   H(8am1)->542929292    -   H(am.2)->629292229    -   H(m.s1)->702202232    -   H(.sa2)->322243349    -   H(8 am2)->993923828    -   H(am.3)->163393269    -   H(m.i1)->595437753    -   H(.i.1)->843438583    -   H(i.a2)->244485639    -   H(.am4)->493869359

In one embodiment, processing logic then sorts the hash values asfollows:

-   -   163393269    -   200604023    -   220643033    -   246685639    -   322263369    -   458368732    -   493869359    -   542929292    -   554034022    -   595637753    -   629292229    -   692939349    -   702202232    -   843438583    -   933923828

At processing block 708, processing logic selects a subset of hashvalues from the calculated hash values. In one embodiment, processinglogic selects X smallest values from the sorted hash values and createsfrom them a “sketch” of the document. For example, for X=4, the sketchcan be expressed as follows:

-   -   [163393269 200404023 220443033 244485639]

At processing block 710, processing logic creates a signature of thedocument by adding to the sketch a parameter pertaining to the tokens ofthe document. In one embodiment, the parameter specifies the number oforiginal tokens in the document. In the example above, the number oforiginal tokens is 15. Hence, the signature of the document can beexpressed as follows:

-   -   [15 163393269 200404023 220443033 244485639].        Alternatively, the parameter may specify any other        characteristic of the content of the document (e.g., the size of        the document, the keyword associated with the document, etc.).

FIG. 8 is a flow diagram of one embodiment of a process 800 fordetermining whether two documents are similar. The process may beperformed by processing logic that may comprise hardware (e.g.,dedicated logic, programmable logic, microcode, etc.), software (such asrun on a general purpose computer system or a dedicated machine), or acombination of both.

Referring to FIG. 8, process 800 begins with processing logic comparingthe token numbers specified in the signatures of documents 1 and 2, anddetermining whether the token number in the first signature is withinthe allowed range with respect to the token number from the secondsignature (decision box 802). For example, the allowed range may be adifference of 1 or less or a ratio of 90 percent or higher.

If the token number in the first signature is outside of the allowedrange with respect to the token number from the second signature,processing logic decides that documents 1 and 2 are different(processing block 808). Otherwise, if the token number in the firstsignature is within the allowed range with respect to the token numberfrom the second signature, processing logic determines whether theresemblance between hash values in signatures 1 and 2 exceeds athreshold (e.g., more than 95 percent of hash values are the same)(decision box 804). If so, processing logic decides that the twodocuments are similar (processing block 806). If not, processing logicdecides that documents 1 and 2 are different (processing block 808).

Email Spam Filtering Using Noise Reduction

FIG. 9 is a flow diagram of one embodiment of a process 900 for reducingnoise in an email message. The process may be performed by processinglogic that may comprise hardware (e.g., dedicated logic, programmablelogic, microcode, etc.), software (such as run on a general purposecomputer system or a dedicated machine), or a combination of both.

Referring to FIG. 9, process 900 begins with processing logic detectingin an email message data indicative of noise (processing block 902). Asdiscussed above, noise represents data that is invisible to a recipientof the mail message and was added to the email message to avoid spamfiltering. Such data may include, for example, formatting data (e.g.,HTML tags), numeric character references, character entity references,URL data of predefined categories, etc. Numeric character referencesspecify the code position of a character in the document character set.Character entity references use symbolic names so that authors need notremember code positions. For example, the character entity reference&aring refers to the lowercase “a” character topped with a ring.

At processing block 904, processing logic modifies the content of theemail message to reduce the noise. In one embodiment, the contentmodification includes removing formatting data, translating numericcharacter references and charcater entity references to their ASCIIequivalents, and modifying URL data.

At processing block 906, processing logic compares the modified contentof the email message with the content of a spam message. In oneembodiment, the comparison is performed to identify an exact match.Alternatively, the comparison is performed to determine whether the twodocuments are similar.

FIG. 10 is a flow diagram of one embodiment of a process 1000 formodifying an email message to reduce noise. The process may be performedby processing logic that may comprise hardware (e.g., dedicated logic,programmable logic, microcode, etc.), software (such as run on a generalpurpose computer system or a dedicated machine), or a combination ofboth.

Referring to FIG. 10, process 1000 begins with processing logicsearching an email message for formatting data (e.g., HTML tags)(processing block 1002).

At decision box 1004, processing logic determines whether the foundformatting data qualifies as an exception. Typically, HTML formattingdoes not add anything to the information content of a message. However,a few exceptions exist. These exceptions are the tags that containuseful information for further processing of the message (e.g., tags<BODY>, <A>, <IMG>, and <FONT>). For example, the <FONT> and <BODY> tagsare needed for “white on white” text elimination, and the <A> and <IMG>tags typically contain link information that may be used for passingdata to other components of the system.

If the formatting data does not qualify as an exception, the formattingdata is extracted from the email message (processing block 1006).

Next, processing logic converts each numerical character reference andcharacter entity reference into a corresponding ASCII character(processing block 1008).

In HTML, numeric character references may take two forms:

-   -   1. The syntax “&#D;”, where D is a decimal number, refers to the        ISO 10646 decimal character number D; and    -   2. The syntax “&#xH;” or “&#XH;”, where H is a hexadecimal        number, refers to the ISO 10646 hexadecimal character number H.        Hexadecimal numbers in numeric character references are        case-insensitive.        For example, randomized characters in the body may appear as a        following expression:        Th&#101&#32&#83a&#118&#105n&#103&#115R&#101&#103 is        &#116e&#114&#119&#97&#110&#116&#115&#32yo&#117.        This expression has a meaning of the phrase “The SavingsRegister        wants you.”

Some times the conversion performed at processing block 1008 may need tobe repeated. For example, the string “&#38;” corresponds to the string“&” in ASCII, the string “&#35;” corresponds to the string “#” in ASCII,the string “&#51;” corresponds to 3 in ASCII, the string “#56;”corresponds to 8 in ASCII, and “#59;” corresponds to the string “;” inASCII. Hence, the combined string “&#38;&#35;&#51;&#56;&#59;”, whenconverted, results in the string “&#38;” that needs to be converted.

Accordingly, after the first conversion operation at processing block1008, processing logic checks whether the converted data still includesnumeric character references or character entity references (decisionbox 1010). If the check is positive, processing logic repeats theconversion operation at processing block 1008. Otherwise, processinglogic proceeds to processing block 1012.

At processing block 1012, processing logic modifies URL data ofpredefined categories. These categories may include, for example,numerical character references contained in the URL that are convertedby processing logic into corresponding ASCII characters. In addition,the URL “password” syntax may be used to add characters before an “@” inthe URL hostname. These characters are ignored by the target web serverbut they add significant amounts of noise information to each URL.Processing logic modifies the URL data by removing these additionalcharacters. Finally, processing logic removes the “query” part of theURL, following a string “?” at the end of the URL.

An example of a URL is as follows:

http %3a %2f %2flotsofjunk@www.foo.com %2fbar.html?muchmorejunk

Processing logic modifies the above URL data intohttp://www.foo.coni/bar.html.

An Exemplary Computer System

FIG. 11 is a block diagram of an exemplary computer system 1100 that maybe used to perform one or more of the operations described herein. Inalternative embodiments, the machine may comprise a network router, anetwork switch, a network bridge, Personal Digital Assistant (PDA), acellular telephone, a web appliance or any machine capable of executinga sequence of instructions that specify actions to be taken by thatmachine.

The computer system 1100 includes a processor 1102, a main memory 1104and a static memory 1106, which communicate with each other via a bus1108. The computer system 1100 may further include a video display unit1110 (e.g., a liquid crystal display (LCD) or a cathode ray tube (CRT)).The computer system 1100 also includes an alpha-numeric input device1112 (e.g., a keyboard), a cursor control device 1114 (e.g., a mouse), adisk drive unit 1116, a signal generation device 1120 (e.g., a speaker)and a network interface device 1122.

The disk drive unit 1116 includes a computer-readable medium 1124 onwhich is stored a set of instructions (i.e., software) 1126 embodyingany one, or all, of the methodologies described above. The software 1126is also shown to reside, completely or at least partially, within themain memory 1104 and/or within the processor 1102. The software 1126 mayfurther be transmitted or received via the network interface device1122. For the purposes of this specification, the term“computer-readable medium” shall be taken to include any medium that iscapable of storing or encoding a sequence of instructions for executionby the computer and that cause the computer to perform any one of themethodologies of the present invention. The term “computer-readablemedium” shall accordingly be taken to included, but not be limited to,solid-state memories, optical and magnetic disks, and carrier wavesignals.

Whereas many alterations and modifications of the present invention willno doubt become apparent to a person of ordinary skill in the art afterhaving read the foregoing description, it is to be understood that anyparticular embodiment shown and described by way of illustration is inno way intended to be considered limiting. Therefore, references todetails of various embodiments are not intended to limit the scope ofthe claims which in themselves recite only those features regarded asessential to the invention.

1. A method comprising: dividing a first document into a plurality oftokens, each token including a predefined number of sequentialcharacters from the first document; calculating a plurality of hashvalues for the plurality of tokens; and creating, for the firstdocument, a signature including a subset of hash values from theplurality of hash values and additional information pertaining to theplurality of tokens of the first document, the signature of the firstdocument being subsequently compared with a signature of a seconddocument to determine resemblance between the first document and thesecond document.
 2. The method of claim 1 further comprising: prior todividing the first document into the plurality of tokens, changing eachupper case alphabetic character within the first document to a lowercase alphabetic character, and changing each non-alphabetic characterwithin the first document to a single predefined non-alphabeticcharacter.
 3. The method of claim 1 wherein the first document is afirst email message and the second document is a second email message.4. The method of claim 1 wherein calculating the plurality of hashvalues for the plurality of tokens comprises: creating a shingle foreach of the plurality of tokens by combining said each of the pluralityof tokens with a number of occurrences of said each of the plurality oftokens within the first document; and applying a hashing function toeach created shingle.
 5. The method of claim 4 wherein shingles createdfor the plurality of tokens are represented as a histogram.
 6. Themethod of claim 1 wherein the predefined number of sequential charactersin each token is equal to three.
 7. The method of claim 1 wherein theadditional information pertaining to the plurality of tokens comprises anumber of the plurality of tokens contained in the first document. 8.The method of claim 1 wherein creating the signature for the firstdocument comprises: sorting the plurality of hash values; and selectinga predefined number of smallest hash values from the sorted plurality ofhash values.
 9. The method of claim 7 further comprising: determiningwhether the number of the plurality of tokens contained in the firstdocument is within an allowed range with respect to a number of aplurality of tokens contained in the second document; and if the numberof the plurality of tokens contained in the first document is not withinthe defined range from the number of the plurality of tokens containedin the second document, deciding that the first document does notresemble the second document.
 10. The method of claim 9 furthercomprising: determining that the number of the plurality of tokenscontained in the first document is within the defined range from thenumber of the plurality of tokens contained in the second document;determining whether the subset of hash values contained in the signatureof the first document is similar to a subset of hash values contained inthe signature of the second document; and if the subset of hash valuescontained in the signature of the first document is similar to thesubset of hash values contained in the signature of the second document,deciding that the first document resembles the second document.
 11. Themethod of claim 10 wherein the second email message is a spam emailmessage.
 12. The method of claim 11 further comprising: marking thefirst email message as spam upon deciding that the first documentresembles the second document.
 13. A system comprising: a parser todivide a first document into a plurality of tokens, each token includinga predefined number of sequential characters from the first document;and a message data generator to calculate a plurality of hash values forthe plurality of tokens, and to create, for the first document, asignature including a subset of hash values from the plurality of hashvalues and additional information pertaining to the plurality of tokensof the first document, the signature of the first document beingsubsequently compared with a signature of a second document to determineresemblance between the first document and the second document.
 14. Thesystem of claim 13 wherein the message data generator is further tochange each upper case alphabetic character within the first document toa lower case alphabetic character, and to change each non-alphabeticcharacter within the first document to a single predefinednon-alphabetic character.
 15. The system of claim 13 wherein the firstdocument is a first email message and the second document is a secondemail message.
 16. The system of claim 13 wherein the message datagenerator is to calculate the plurality of hash values for the pluralityof tokens by creating a shingle for each of the plurality of tokens bycombining said each of the plurality of tokens with a number ofoccurrences of said each of the plurality of tokens within the firstdocument, and applying a hashing function to each created shingle. 17.The system of claim 13 wherein the predefined number of sequentialcharacters in each token is equal to three.
 18. The system of claim 13wherein the additional information pertaining to the plurality of tokenscomprises a number of the plurality of tokens contained in the firstdocument.
 19. The system of claim 13 wherein the message data generatoris to create the signature for the first document by sorting theplurality of hash values, and selecting a predefined number of smallesthash values from the sorted plurality of hash values.
 20. The system ofclaim 18 further comprising a resemblance identifier to determinewhether the number of the plurality of tokens contained in the firstdocument is within an allowed range with respect to a number of aplurality of tokens contained in the second document, and, if the numberof the plurality of tokens contained in the first document is not withinthe defined range from the number of the plurality of tokens containedin the second document, to decide that the first document does notresemble the second document.
 21. The system of claim 20 wherein theresemblance identifier to determine that the number of the plurality oftokens contained in the first document is within the defined range fromthe number of the plurality of tokens contained in the second document,to determine whether the subset of hash values contained in thesignature of the first document is similar to a subset of hash valuescontained in the signature of the second document, and, if the subset ofhash values contained in the signature of the first document is similarto the subset of hash values contained in the signature of the seconddocument, to decide that the first document resembles the seconddocument.
 22. An apparatus comprising: means for dividing a firstdocument into a plurality of tokens, each token including a predefinednumber of sequential characters from the first document; means forcalculating a plurality of hash values for the plurality of tokens; andmeans for creating, for the first document, a signature including asubset of hash values from the plurality of hash values and additionalinformation pertaining to the plurality of tokens of the first document,the signature of the first document being subsequently compared with asignature of a second document to determine resemblance between thefirst document and the second document.
 23. The apparatus of claim 22wherein the predefined number of sequential characters in each token isequal to three.
 24. The apparatus of claim 22 wherein the additionalinformation pertaining to the plurality of tokens comprises a number ofthe plurality of tokens contained in the first document.
 25. A computerreadable medium comprising executable instructions which when executedon a processing system cause said processing system to perform a methodcomprising: dividing a first document into a plurality of tokens, eachtoken including a predefined number of sequential characters from thefirst document; calculating a plurality of hash values for the pluralityof tokens; and creating, for the first document, a signature including asubset of hash values from the plurality of hash values and additionalinformation pertaining to the plurality of tokens of the first document,the signature of the first document being subsequently compared with asignature of a second document to determine resemblance between thefirst document and the second document.
 26. The computer readable mediumof claim 25 wherein the predefined number of sequential characters ineach token is equal to three.
 27. The computer readable medium of claim25 wherein the additional information pertaining to the plurality oftokens comprises a number of the plurality of tokens contained in thefirst document.